16 research outputs found

    SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

    Full text link
    The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.Comment: Accepted to Uncertainty in Artificial Intelligence (UAI) 2023 Conference; 13 pages, 4 figures (Main Paper) + 5 pages (Supplementary Material

    Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models

    Full text link
    Knowledge Distillation (KD) is a prominent neural model compression technique which heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need the teacher labels to fine-tune smaller PLM student networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary

    Derivation of Mouse Haploid Trophoblast Stem Cells

    No full text
    Summary: Trophoblast stem (TS) cells are increasingly used as a model system for studying placentation and placental disorders. However, practical limitations of genetic manipulation have posed challenges for genetic analysis using TS cells. Here, we report the generation of mouse parthenogenetic haploid TS cells (haTSCs) and show that supplementation with FGF4 and inhibition of Rho-associated protein kinase (ROCK) enable the maintenance of their haploidy and developmental potential. The resulting haTSCs have 20 chromosomes, exhibit typical expression features of TS cells, possess the multipotency to differentiate into specialized trophoblast cell types, and can chimerize E13.5 and term placentas. We also demonstrate the capability of the haTSCs to undergo genetic manipulation and facilitate genome-wide screening in the trophoblast lineage. We expect that haTSCs will offer a powerful tool for studying functional genomics and placental biology. : Trophoblast stem (TS) cells are increasingly used as a model system for studying placentation and placental disorders. Here, Cui et al. report the generation of mouse haploid TS cells, which possess a wide extraembryonic developmental potential and can serve as a powerful tool for studying functional genomics and placental biology. Keywords: haploidy, trophoblast, stem cells, TS

    Loss‐of‐Function of p21‐Activated Kinase 2 Links BMP Signaling to Neural Tube Patterning Defects

    No full text
    Abstract Closure of the neural tube represents a highly complex and coordinated process, the failure of which constitutes common birth defects. The serine/threonine kinase p21‐activated kinase 2 (PAK2) is a critical regulator of cytoskeleton dynamics; however, its role in the neurulation and pathogenesis of neural tube defects (NTDs) remains unclear. Here, the results show that Pak2−/− mouse embryos fail to develop dorsolateral hinge points (DLHPs) and exhibit craniorachischisis, a severe phenotype of NTDs. Pak2 knockout activates BMP signaling that involves in vertebrate bone formation. Single‐cell transcriptomes reveal abnormal differentiation trajectories and transcriptional events in Pak2−/− mouse embryos during neural tube development. Two nonsynonymous and one recurrent splice‐site mutations in the PAK2 gene are identified in five human NTD fetuses, which exhibit attenuated PAK2 expression and upregulated BMP signaling in the brain. Mechanistically, PAK2 regulates Smad9 phosphorylation to inhibit BMP signaling and ultimately induce DLHP formation. Depletion of pak2a in zebrafish induces defects in the neural tube, which are partially rescued by the overexpression of wild‐type, but not mutant PAK2. The findings demonstrate the conserved role of PAK2 in neurulation in multiple vertebrate species, highlighting the molecular pathogenesis of PAK2 mutations in NTDs
    corecore